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Abstract —The Apache Accumulo database excels at dis¬ 
tributed storage and indexing and is ideally suited for storing 
graph data. Many big data analytics compute on graph data 
and persist their results back to the database. These graph 
calculations are often best performed inside the database server. 
The GraphBLAS standard provides a compact and efficient basis 
for a wide range of graph applications through a small number 
of sparse matrix operations. In this article, we discuss a server- 
side implementation of GraphBLAS sparse matrix mnltiplication 
that leverages Accumulo’s native, high-performance iterators. 
We compare the mathematics and performance of inner and 
outer product implementations, and show how an outer product 
implementation achieves optimal performance near Accumulo’s 
peak write rate. We offer our work as a core component to the 
Graphulo library that will deliver matrix math primitives for 
graph analytics within Accumulo. 

I. Introduction 

The Apache Accumulo NoSQL database was designed for 
high performance ingest and scans [1]. While fast ingest and 
scans solve some big data problems, more complex scenarios 
involve performing tasks such as data enrichment, graph algo¬ 
rithms and clustering analytics. These techniques often require 
moving data from a database to a compute node. The ability to 
compute directly in a database can lead to benefits including 
data locality, infrastructure reuse and selective access. 

Accumulo administrators commonly create data locality by 
running server processes on the physical nodes where data is 
stored and cached. Computing within Accumulo takes advan¬ 
tage of this locality by avoiding unnecessary network transfer, 
effectively moving “compute to data” like a stored procedure, 
in contrast to client-server models that move “data to com¬ 
pute”. Performing computation inside Accumulo also reuses 
its distributed infrastructure such as write-ahead logging, fault- 
tolerant execution, and parallel load balancing of data. In 
particular, Accumulo’s infrastructure enables selective access 
to data along its indexed attributes (rows), which enhances the 
performance of algorithms written with row access patterns. 

There are a variety of ways to store graphs in Accumulo. 
One common schema is to store graphs as sparse matrices. 
Researchers in the GraphBLAS forum [2] have identified a 
small set of kernels that form a basis for matrix algorithms 
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useful for graphs when represented as sparse matrices. This 
article presents Graphulo, an effort to realize the GraphBLAS 
primitives that enable algorithms using matrix mathematics 
directly in Accumulo servers [3]. 

In this paper we focus on Sparse Generalized Matrix Mul¬ 
tiply (SpGEMM), the core kernel at the heart of GraphBLAS. 
Many GraphBLAS primitives can be expressed in terms of 
SpGEMM via user-defined multiplication and addition func¬ 
tions. SpGEMM can be used to implement a wide range of 
algorithms from graph search [4] to table joins [5] and many 
others (see introduction of [6]). 

We call our implementation of SpGEMM in Accumulo 
TableMult, short for multiplication of Accumulo tables. 
Accumulo tables have many similarities to sparse matrices, 
though a more precise mathematical definition is Associative 
Arrays [7]. Eor this work, we concentrate on large distributed 
tables that may not fit in memory and use a streaming approach 
that leverages Accumulo’s built-in distributed infrastructure. 

We are particularly interested in Graphulo for queued ana¬ 
lytics [8], that is, analytics on selected table subsets. Queued 
analytics maximally leverage databases by quickly accessing 
subsets of interest, whereas whole-table analytics may perform 
better on parallel file systems such as Lustre or Hadoop. We 
therefore prioritize smaller problems that require low latency 
to enable analysts to explore graph data interactively. 

We review Accumulo and its model for server-side com¬ 
putation, iterator stacks, in Section I-A. We define matrix 
multiplication and compare inner and outer product methods in 
Section II-A, settling on outer product for implementing Table- 
Mult. We show TableMult’s design as Accumulo iterators in 
Section II-B and test TableMult’s scalability with experiments 
in Section III. We discuss related work, design alternatives and 
optimizations in Section IV and conclude in Section V. 

A. Primer: Accumulo and its Iterator Stack 

Accumulo stores data in Hadoop REiles as byte arrays 
indexed by key using (key, value) pairs called entries. Keys 
decompose further into 5-tuples consisting of a row, column 
family, column qualifier, visibility and timestamp. Eor sim¬ 
plicity, we focus on a 2-tuple key consisting of a row and 
column qualifier. Entries belong to tables, which Accumulo 
divides into tablets and assigns to tablet servers. Client appli¬ 
cations write new entries via BatchWriters and retrieve entries 
sequentially via Scanners or in parallel via BatchScanners. 



Accumulo’s server-side programming model runs an itera¬ 
tor stack on tablets in range of a scan. An iterator stack is 
a set of data streams originating at Accumulo’s data sources 
for a specific tablet (Hadoop RFiles and cached in-memory 
maps), converging together in merge-sorts, flowing through 
each iterator in the stack and at the end, sending entries to the 
client. Iterators themselves are Java classes implementing the 
SortedKeyValuelterator (SKVI) interface. 

Developers add custom logic for server-side computation by 
writing new iterators and plugging them into the iterator stack. 
In return for fitting their computation in the SKVI paradigm, 
developers gain distributed parallelism for free as Accumulo 
runs their iterators on relevant tablets simultaneously. 

SKVIs are reminiscent of built-in Java iterators in that they 
hold state and emit one entry at a time until finished iterating. 
However, they are more powerful than Java iterators in that 
they can seek to arbitrary positions in the data stream. They 
also have two constraints: the end of the iterator stack should 
emit entries in sorted order, and iterators must not maintain 
volatile state such as threads, open files or sockets because 
Accumulo may destroy, re-create and re-seek an iterator stack 
between function calls without allowing time to clean up. 

Iterators are most commonly used for “reduction” oper¬ 
ations that transform or eliminate entries passing through. 
The Accumulo community generally discourages “generator” 
iterators that emit new entries not present in data sources 
because they are easy to misuse and violate SKVI constraints 
by emitting entries out of order or relying on volatile state. 
In this work, we suggest a new pattern for iterator usage as 
a conduit for client write operations that achieves the benefits 
of generator iterators while avoiding their constraints. 

II. TableMult Design 
A. Matrix Multiplication 

Given matrices A of size N x M, B of size M x L, and 
operations 0 and 0 for element-wise addition and multipli¬ 
cation, the matrix product C = A0.0B, or more shortly 
C = AB, defines entries of result matrix C as 

M 

C(z,j)=0A(*,A:)0B(fc,j) 

k=l 

We call intermediary results of 0 operations partial products. 

For the sake of sparse matrices, we only perform 0 and 0 
when both operands are nonzero, an optimization stemming 
from requiring that 0 is an additive identity such that a 0 0 = 
0 0 a = o, and that 0 is a multiplicative annihilator such that 
a0O = O0a = O. Without these conditions, zero operands 
could generate nonzero results that destroy sparsity. 

We study two well known patterns for computing matrix 
multiplication: inner product and outer product [9]. They differ 
in the order in which they perform the 0 and 0 operations. 
The more common inner product approach runs the following: 

for i = 1-.N 
for j = 1\L 

I C(^,J)0=A(^,:)B(:,J) 


performing inner product on vectors. For easier comparison, 
we rewrite the above approach with summation deferred as: 

for t = 1: A 
for j = 1: L 
for fc = 1: M 

I 0= A(i,fc) 0B(fc,j) 

Inner product has the advantage of generating entries in 
sorted order: the third-level loop generates all partial products 
needed to compute a particular element C{i,j) consecutively. 
The 0 applies immediately after each third-level loop to obtain 
an element in C. Inner product is therefore easy to “pre-sum,” 
an Accumulo term for applying a Combiner locally before 
sending entries to a remote but globally-aware table Combiner. 
Emitting sorted entries also facilitates inner product use in 
standard iterator stacks and easier operation pipelining. 

Despite inner product’s order-preserving advantages, outer 
product performs better for sparse matrices because it passes 
through A and B only once [10] [11]. Inner product’s second- 
level loop repeats a scan over all of B for each row of A. 
Under our assumption that we cannot fit B entirely in memory, 
multiple passes over B translate to multiple Accumulo scans 
that each require a disk read. We found in performance tests 
that an outer product approach performs an order of magnitude 
better than an inner product approach. 

The outer product approach runs the following: 

for fc = 1: M 
I C0= A(:,fc)B(fc,:) 

performing outer product on vectors that corresponds to many 
elements of C. Unfolding outer product reveals them as: 

for fc = 1: M 

for i = 1: A 

for j = 1: L 

I C(i,j) 0= A(i,fc) 0B(fc,j) 

Compared to inner product, outer product moves the k loop 
above the i and j loops that determine position in C. The 
switch results in generating partial products out of order. 

On the other hand, outer product only requires a single pass 
over both input matrices. This is because the top-level k loop 
fixes a dimension of both A and B. Once we finish processing 
a column of A and row of B, we never need read them again. 

In terms of memory usage, outer product works best when 
either the matching row or column fits in memory. If nei¬ 
ther fits, we could run the algorithm with a “no memory 
assumption” streaming approach by re-reading B’s rows while 
streaming through A’s columns (or vice versa by symmetry 
of i and j), perhaps at the cost of extra disk reads. 

Because k runs along A’s second dimension and Accumulo 
uses row-oriented data layouts, we implement TableMult to 
operate on A’s transpose A^. 

B. TableMult Iterators 

TableMult uses three iterators placed on a BatchScan of 
table B: RemoteSourcelterator, TwoTablelterator and Re¬ 
mote Writelterator. A BatchScanner directs Accumulo to run 
the iterators on B’s tablets in parallel. 



The key idea behind the TableMult iterators is that they 
divert normal dataflow by opening a BatchWriter, redirecting 
entries out-of-band to C via Accumulo’s unsorted ingest 
channel. The scan itself emits no entries except for a small 
number of “monitoring entries” that inform the client about 
TableMult progress. We permit multi-table iterator dataflow 
by opening Scanners that read remote Accumulo tables out- 
of-band. Scanners and BatchWriters are standard tools for 
Accumulo clients; by creating them inside iterators, we enable 
client-side processing patterns within tablet servers. 

Underlying our use of iterators. Scanners and BatchWriters 
are Accumulo’s standing thread pools. Thread pools fulfill our 
low latency requirement by executing upon receiving a request 
at no more expense than a context switch. Scaling up may 
require tuning thread pool size to balance thread contention. 

We illustrate TableMult’s data flow in Figure 1, placing a 
Scanner on table A''' and a BatchWriter on result table C. 



Fig. 1; Data flow through the TableMult iterator stack 


1) RemoteSourcelterator: RemoteSourcelterator scans an 
Accumulo table (not necessarily in the same cluster) using 
credentials passed from the client through iterator options. 

We also use iterator options to specify row and column 
subsets, encoding them in a string format similar to that in 
D4M [12]. Row subsets are straightforward since Accumulo 
uses row-oriented indexing. Column subsets can be imple¬ 
mented with filter iterators but do not improve performance 
since Accumulo must read every column from disk. We 
encourage users to maintain a transpose table using strategies 
similar to the D4M Schema [13] for cases requiring column 
indexing. 

Multiplying table subsets is crucial for queued analytics on 


selected rows. However for simpler performance evaluation, 
our experiments in Section III multiply whole tables. 

2) TwoTablelterator: TwoTablelterator reads from two it¬ 
erator sources, one for and one for B, and performs the 
core operations of the outer product algorithm in three phases; 

1) Align Rows. Read entries from A^ and B until they 
advance to a matching row or one runs out of entries. 
We skip non-matching rows since they would multiply 
with an all-zero row that, by Section II-A’s assumptions, 
generate all zero partial products. 

2) Cartesian product. Read both matching rows into an in¬ 
memory data structure. Initialize an iterator that emits 
pairs of entries from the rows’ Cartesian product. 

3) Multiply. Pass pairs of entries to 0 and emit results. 

A client defines 0 by specifying a class that implements a 
multiply interface. For our experiments we implement (g) as 
java.math.BigDecimal multiplication, which guarantees cor¬ 
rectness under large or precise real numbers. BigDecimal 
decoding did not noticeably impact performance. 

3) RemoteWritelterotor: RemoteWritelterator writes en¬ 
tries to a remote Accumulo table using a BatchWriter. Entries 
do not have to be in sorted order because Accumulo sorts 
incoming entries as part of its ingest process. 

Barring extreme events such as exceptions in the iterator 
stack or thread death, we designed RemoteWritelterator to 
maintain correctness, such that entries generated from its 
source write to the remote table once. We accomplish this by 
performing all BatchWriter operations within a single function 
call before ceding thread control back to the tablet server. 

A performance concern remains when multiplying a subset 
of the input tables’ rows that consists of many disjoint ranges, 
such as one million “singleton” ranges spanning one row each. 
It is inefficient to flush the BatchWriter before returning from 
each seek call, which happens once per disjoint scan range. 
We accommodate this case by “transferring seek control” from 
the tablet server to RemoteWritelterator via the same strategy 
used in RemoteSourcelterator for seeking within an iterator. 

We include an option to BatchWrite C’s transpose 
in place of or alongside C. Writing C' facilitates chaining 
TableMults together and maintenance of transpose indexing. 

4) Lazy ©.' We lazily sum partial products by placing 
a Combiner subclass implementing BigDecimal addition on 
table C at scan, minor and major compaction scopes. Thus, 
© occurs sometime after RemoteWritelterator writes partial 
products to C yet necessarily before entries from C may 
be seen such that we always achieve correctness. Summation 
could happen when Accumulo flushes C’s entries cached in 
memory to a new RFile, when Accumulo compacts RFiles 
together, or when a client scans C. 

The key algebraic requirement for implementing © inside 
a Combiner is that © must be associative and commutative. 
These properties allow us to apply © to subsets of a result 
element’s partial products and to any ordering of them, which 
is chaotic by outer product’s nature. If we truly had an © 
operation that required seeing all partial products at once, we 


































would have to either gather partial products at the client or 
initiate a full major compaction. 

5) Monitoring: RemoteWritelterator never emits entries to 
the client by default. One downside of this approach is that 
clients cannot precisely track progress of a TableMult opera¬ 
tion, which may frustrate users expecting a more interactive 
computing experience. Clients could query the Accumulo 
monitor for read/write rates or prematurely scan partial prod¬ 
ucts written to C, but both approaches are too coarse. 

We therefore implement a monitoring option that emits 
a value containing the number of entries TwoTablelterator 
processed at a client-set frequency. RemoteWritelterator emits 
monitoring entries at “safe” points, that is, points at which we 
can recover the iterator stack’s state if Accumulo destroys, re¬ 
creates and re-seeks it. Stopping after emitting the last value 
in the outer product of two rows is safe because we place 
the last value’s row key in the monitoring key and know, in 
the event of an iterator stack rebuild, to proceed to the next 
matching row. We may succeed in stopping during an outer 
product by encoding more information in the monitoring key. 

III. Performance 

We evaluate TableMult with two variants of an experiment. 
First we measure the rate of computation as problem size 
increases. We define problem size as number of rows in 
random input graphs represented as adjacency tables and 
rate of computation as number of partial products processed 
per second. Second we repeat the experiment for a fixed 
size problem with all tables split into two tablets, allowing 
Accumulo to scan and write to them in parallel. 

We compare Graphulo TableMult performance to D4M [12] 
as a baseline because a user with one client machine’s best al¬ 
ternative is reading input graphs from Accumulo, multiplying 
them at the client, and inserting the result back into Accumulo. 

D4M stores tables as Associative Array objects in Matlab. 
Because Assoc Array multiplication runs fast by calling Mat- 
lab’ s in-memory sparse matrix functions, D4M bottlenecks 
on reading data from Accumulo and especially on writing 
back results, despite its capacity for high speed Accumulo 
reads and writes [14]. We consequently expect TableMult to 
outperform D4M because TableMult avoids transferring data 
out of Accumulo for processing. 

We also expect TableMult to succeed on larger graph sizes 
than D4M because TableMult uses a streaming outer product 
algorithm that does not store input tables in memory. An 
alternative D4M implementation would mirror TableMult’s 
streaming outer product algorithm, enabling D4M to run on 
larger problem sizes at potentially worse performance. We 
therefore imagine the whole-table D4M algorithm as an upper 
bound on the best performance achievable when multiplying 
Accumulo tables outside Accumulo’s infrastructure. 

We use the GraphSOO unpermuted power law graph gener¬ 
ator [15] to create random input tables, such that both tables’ 
first row have high degree (number of columns) and subse¬ 
quent rows exponentially decrease in degree. The common 
power law structure correlates the input tables, which leads 


to denser result tables than if we were to permute the input 
tables but does not otherwise affect multiplication behavior. 
The generator takes SCALE and EdgesPerVertex parameters, 
creating graphs with rows and EdgesPerVertex x 

2 SCALE entries. We fix EdgesPerVertex to 16 and use SCALE 
to vary problem size. 

The following procedure outlines our performance experi¬ 
ment for a given SCALE and either one or two tablets. 

1) Generate two graphs with different random seeds and 
insert them into Accumulo as adjacency tables via D4M. 

2) In the case of two tablets, identify an optimal split point 
for each input graph and set the input graphs’ table splits 
equal to that point. “Optimal” here means a split point 
that evenly divides an input graph into two tablets. 

3) Create an empty output table. Eor two tablets, pre-split 
it with an optimal input split position recorded from a 
previous multiplication run. 

4) Compact the input and output tables so that Accumulo 
redistributes the tables’ entries into the assigned tablets. 

5) Run and time Graphulo TableMult multiplying the trans¬ 
pose of the first input table with the second. 

6) Create, pre-split and compact a new result table for the 
D4M comparison as in step 3 and 4. 

7) Run and time the D4M equivalent of TableMult: 

a) Scan both input tables into D4M Associative Array 
objects in Matlab memory. 

b) Convert the string values from Accumulo into 
numeric values for each Associative Array. 

c) Multiply the transpose of the first Associative 
Array with the second. 

d) Convert the result Associative Array back to String 
values and insert them into Accumulo. 

We conducted the experiments on a Ubuntu Linux laptop 
with 16GB RAM and two dual-core Intel i7 processors. Using 
single-instance Accumulo 1.6.1, Hadoop 2.6.0 and ZooKeeper 
3.4.6, we allocated 2GB of memory to an Accumulo tablet 
server initially (allowing growth in 500MB steps), 1GB for 
native in-memory maps and 256MB for data and index cache. 

We chose not to use more than two tablets per table because 
more threads would run than the laptop could handle. Each 
additional tablet can potentially add the following threads: 

1) Table A''' server-side scan thread; 

2) Table client-side scan thread, 

running from RemoteSourcelterator; 

3) Table B server-side scan/multiply thread, 

running a TableMult iterator stack; 

4) Table B client-side scan thread, 

running from the initiating client, mostly idle; 

5) Table C server-side write thread; 

6) Table C client-side write thread, 

running from RemoteWritelterator; and 

7) Table C server-side minor compaction threads, 

running with a Combiner implementing 0 . 

We show table C sizes and experiment timings in Table I 
and plot them in Eigure 2. We could not run the D4M 




Fig. 2: TableMult Processing Rate vs. Input Table Size 


comparison past SCALE 15 because C does not fit in memory. 

For the scaled problem, the best results we could achieve 
are flat horizontal lines, indicating that we maintain the same 
level of operations per second as problem size increases. 

One reason we see a downward rate trend at larger problem 
sizes is that Accumulo needs to minor compact table C in the 
middle of a TableMult. The compactions trigger flushes to disk 
along with the 0 Combiner that sums partial products written 
to C so far, neither of which we include in rate measurements. 

For the fixed size problem, the best results we could achieve 
are two-tablet rates at double the one-tablet rates at every 
problem size. Our experiment shows that Graphulo two-tablet 
rates perform up to 1.5x better than one-tablet rates at lower 
SCALES. We attribute TableMult’s shortfall to high processor 
contention for the laptop’s four cores as a result of the 14 
threads that may run concurrently when each table has two 
tablets; in fact, processor usage hovered near 100% for all four 
cores throughout the two-tablet experiments. We expect better 
scaling when running our experiment in larger Accumulo 
clusters that can handle more degrees of parallelism. 

IV. Discussion 

A. Related Work 

Bu1u 5 and Gilbert studied message passing algorithms for 
SpGEMM such as Sparse SUMMA, most of which use 2D 
block decompositions [16]. Unfortunately, 2D decompositions 
are difficult in Accumulo and message passing even more so. 
In this work, we use Accumulo’s native ID decomposition 
along rows and do not rely on tablet server communication 
apart from shuffling partial products of C via BatchWriters. 

Our outer product method could have been implemented 
on Hadoop MapReduce or its successor YARN [17]. There is 
a natural analogy from TableMult to MapReduce: the map 
phase scans rows from A^ and B and generates a list of 
partial products from TwoTablelterator; the shuffle phase sends 


partial products to correct tablets of C via BatchWriters; 
and the reduce phase sums partial products using Combiners. 
Examining the conditions on which MapReduce reading from 
and writing to Accumulo’s RFiles directly can outperform 
Accumulo-only solutions is worthy future work. 

A common Accumulo pattern is to scan and write from mul¬ 
tiple clients in parallel; in fact, researchers obtained consider¬ 
ably high insert rates using parallel client strategies [14]. We 
chose to build Graphulo as a service within Accumulo instead 
of assuming a multiple client capability, such that Graphulo is 
as accessible as possible to diverse client environments. 

The strategy in [14] also used tablet location information to 
determine where clients could write locally. Knowing tablet- 
to-tablet-server assignment could likewise aid Graphulo, not 
only to minimize network traffic but also to partly eliminate 
Apache Thrift RPC serialization, which prior work has shown 
is a bottleneck for scans when iterator processing is light [18]. 
Such an enhancement would access a local tablet server by 
method call in place of Scanners and BatchWriters. 

The Knowledge Discovery Toolkit (KDT) distributed- 
memory Python graph library offers sparse matrix multipli¬ 
cation in a similar design as Graphulo’s [19]. Both support 
custom addition, multiplication and Alter operators written in 
a high level language. They differ in that Graphulo targets the 
Accumulo infrastructure which is lO-bound, in contrast to the 
KDT which is compute-bound. Graphulo therefore gains less 
from code generation techniques on its Java iterator kernels, 
whereas the KDT uses the SEJITS technique [20] to translate 
Python kernels into C-H- for callback by KDT’s underlying 
Combinatorial BLAS library [21], thereby raising performance 
from compute- to memory bandwidth-bound at the expense of 
restricting operator expressiveness to a DSL. 


B. Design Alternative: Inner-Outer Product Hybrid 

It is worth reconsidering the inner product method from our 
initial design because it has an opposite performance profile as 
Figure 3’s left and right depict: inner product bottlenecks on 
scanning whereas outer product bottlenecks on writing. At the 
expense of multiple passes over input matrices, inner product 
emits partial products in order and immediately pre-summable, 
reducing the number of entries written to Accumulo to the 
minimum possible. Outer product reads inputs in a single pass 
but emits entries out of order and has little chance to pre¬ 
sum, instead writing individual partial products to C. Table I 
quantifies that outer product writes 2.5 to 3 times more entries 
than inner product for power law inputs. In the worst case, 
multiplying a fully dense N x M with an M x L matrix, 
outer product emits M times more entries than inner product. 

Is it possible to blend inner and outer product SpGEMM 
methods, choosing a middle point in Figure 3 with equal read 
and write bottlenecks for overall greater performance? In the 
following generalization, partition parameter P varies behavior 






TABLE I: Output Table C Sizes and Experiment Timings 



Entries in 

Table C 

Graphulo 1 Tablet 

D4M 1 Tablet 

Graphulo 2 Tablets 

D4M 2 Tablets 

PartialProducts 

AfterSum 

Time (s) 

Rate (pp/s) 

Time (s) 

Rate (pp/s) 

Time (s) 

Rate (pp/s) 

Time (s) 

Rate (pp/s) 

10 

8.05 X to'" 

2.69 X lO'^ 

2.87 

2.81 X lO'^ 

3.02 

2.67 X lO'^ 

2.02 

3.98 X 10^ 

2.80 

2.87 X 10^ 

11 

2.36 X 10'= 

8.15 X lO'" 

7.76 

3.04 X lO'" 

8.80 

2.68 X lO'’ 

5.19 

4.55 X lO'’ 

8.72 

2.71 X lO'’ 

12 

6.82 X 10® 

2.43 X 10*^ 

2.20 X 10^ 

3.10 X lO'^ 

2.66 X 10^ 

2.56 X 10^ 

1.63 X 10^ 

4.18 X lO'^ 

2.62 X 10^ 

2.60 X lO'^ 

13 

1.91 X 10'' 

7.04 X 10® 

6.40 X 10^ 

2.99 X lO'^ 

1.50 X 10^ 

1.27 X lO'^ 

4.86 X 10^ 

3.93 X 10^ 

1.44 X 10^ 

1.33 X 10^ 

14 

5.27 X 10^ 

2.00 X 10'' 

1.82 X 10^ 

2.90 X lO'" 

5.79 X 10^ 

9.09 X 10-* 

1.36 X 10^ 

3.87 X lO'’ 

5.59 X 10^ 

9.42 X 10"' 

15 

1.47 X 10“ 

5.83 X 10'^ 

5.03 X 10^ 

2.93 X lO'^ 

2.51 X 10^ 

5.86 X lO"' 

3.94 X 10^ 

3.74 X lO'^ 

2.56 X 10^ 

5.75 X 10^ 

16 

4.00 X 10** 

1.63 X 10“ 

1.39 X 10^ 

2.88 X 10^ 


1.18 X 10^ 

3.40 X lO'’ 


17 

1.09 X 10“ 

4.59 X 10“ 

4.06 X 10^ 

2.67 X lO'" 


3.70 X 10^ 

2.94 X lO'* 


18 

2.94 X 10^ 

1.28 X 10“ 

1.21 X lO"' 

2.42 X lO'^ 


1.14 X 10"^ 

2.58 X lO'^ 



between inner product at P = TV and outer product at P = 1: 


for p = 1: P 
for k = 1: M 


for i = 


ip-l)N 

P 


+ 1 


pN 

1 ^ 


for j = 1: L 

I C(T,j) ©= A(T,fc) 0 B(fc,j) 


The hybrid algorithm runs P passes through B, each of 
which has write locality to a vertical partition of C of size 
TV/P X P. Pre-summing ability likewise varies inversely with 
P, though actual pre-summing depends on A and B’s sparsity 
distribution as well as how many positions of C the TableMult 
iterators cache. Eigure 3’s center depicts the P = 2 case. 

A challenge for any hybrid algorithm is mapping it to 
Accumulo infrastructure. We chose outer product because it 
more naturally fits Accumulo, using iterators for one-pass 
streaming computation, BatchWriters to handle unsorted entry 
emission and Combiners to defer summation. The above 
hybrid algorithm resembles 2D block decompositions, and so 
maximizing its performance may be challenging given limited 
data layout control and unknown data distribution. 

Nevertheless, possible design criteria are to select a small 
P to minimize passes through B, while also choosing P large 
enough so that \NL/P~\ entries fit in memory (dense matrix 
worst case), which guarantees complete pre-summing. The 
latter criterion may be relaxed with decreasing matrix density. 


C. TableMult in Algorithms 

Several optimization opportunities exist for TableMult as a 
primitive in larger algorithms. Given row A of starting vertices 
and graph adjacency matrix B, suppose we wish to union the 
vertices reached in two steps from those in A into A via 
the program C = AB;D = CB;A ©= D, as one way of 
calculating A ©= AB^ via TableMult calls. Such calculations 
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are useful for finding vertices reachable in an even number of 
steps. We would save two round trips to disk if we could mark 
C and D as “temporary tables,” i.e. tables intermediate to an 
algorithm that should be held in memory and not written to 
Hadoop if possible. Combiners in TableMult do enable one 
optimization; summing CB into A directly by rewriting the 
program as C = AB; A ©= CB. 

A pipelining optimization streams entries from a TableMult 
to computations taking its result as input. Outer product 
pipelining is difficult because it cannot guarantee writing every 
partial product for a particular element to C until it finishes, 
whereas inner product’s complete pre-summing emits elements 
ready for use downstream. More ambitiously, loop fusion 
merges iterator stacks for successive computations into one. 

Optimizing computation on NoSQL databases is challeng¬ 
ing in the general case because NoSQL databases typically 
avoid query planner features customary of SQL databases in 
exchange for raw performance. NewSQL databases aim in part 
to achieve the best of both worlds—^performance and query 
planning [22]. We aspire to make a small step for Accumulo 
in the direction of NewSQL with current Graphulo research. 

V. Conclusions 

In this work we showcase the design of TableMult, a 
Graphulo server-side implementation of the SpGEMM Graph- 
BLAS matrix math kernel in the Accumulo database. We 
compare inner and outer product approaches and show how 
outer product better fits Accumulo’s iterator model. The imple¬ 
mentation shows excellent single node performance, achieving 
write rates near 400,000 per second, which is consistent with 
the single node peak write rate for Accumulo [14]. Perfor¬ 
mance experiments show good scaling for scaled problem sizes 
and suggest good scaling for fixed size problems, but these 
require additional experiments on a larger cluster to confirm. 

In addition to topics from Section IV’s discussion, future 
research efforts include implementing the remaining Graph- 
BLAS kernels, developing graph algorithms that use the Gra¬ 
phulo library and delivering to the Accumulo community. 
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